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Abstract 

We introduce and demonstrate a new ap¬ 
proach to inference in expressive probabilis¬ 
tic programming languages based on particle 
Markov chain Monte Carlo. Our approach 
is simple to implement and easy to paral¬ 
lelize. It applies to Turing-complete proba¬ 
bilistic programming languages and supports 
accurate inference in models that make use 
of complex control flow, including stochas¬ 
tic recursion. It also includes primitives 
from Bayesian nonparametric statistics. Our 
experiments show that this approach can 
be more efficient than previously introduced 
single-site Metropolis-Hastings methods. 

1 Introduction 

Probabilistic programming differs substantially from 
traditional programming. In particular, probabilistic 
programs are written with parts not fixed in advance 
that instead take values generated at runtime by ran¬ 
dom sampling procedures. Inference in probabilistic 
programming characterizes the conditional distribu¬ 
tion of such variables given observed data assumed 
to have been generated by executing the probabilistic 
program. Exploring the joint distribution over pro¬ 
gram execution traces that could have generated the 
observed data using Markov chain sampling techniques 
is one way to produce such a characterization. 

We propose a novel combination of ideas from prob¬ 
abilistic programming [3] and particle Markov chain 
Monte Carlo (PMCMC) [T] that yields a new scheme 
for exploring and characterizing the set of probable ex¬ 
ecution traces. As our approach is based on repeated 
simulation of the probabilistic program, it is easy to 
implement and parallelize. We show that our approach 

This paper, which originally appeared in the Proceedings of 
the 17*^ International Conference on Artificial Intelligence 
and Statistics (AISTATS) 2014, has been updated to reflect 
changes in the latest version of the Anglican language. 


supports accurate inference in models that make use of 
complex control flow, including stochastic recursion, as 
well as primitives from nonparametric Bayesian statis¬ 
tics. Our experiments also show that this approach can 
be more efficient than previously introduced single-site 
Metropolis-Hastings (MH) samplers [T3] . 


2 Language 

The probabilistic programming system AnglicarQ ex¬ 
ists in two versions. The results in this paper were ob¬ 
tained using what is now called interpreted Anglicar0 
which employed an interpreted execution model and a 
language syntax derived from the Ventur^ modeling 
language. Since the time of original publication, both 
the syntax and execution model have been changed. 
The Venture style syntax is now deprecated, with the 
new version using a syntax much closer to that of 
Anglican’s host language Clojure. This new Angli¬ 
can versiorj^ simply called Anglican, is a compiled 
language; a continuation-passing style (CPS) transfor¬ 
mation compiles Anglican programs into Clojure pro¬ 
grams that are then subsequently compiled to Java 
Virtual Machine (JVM) bytecode by the Clojure just- 
in-time compiler. 

The change in language execution model and syntax 
affect neither the substance or the claims of this pa¬ 
per, however, readers wishing to experiment with the 
language starting from the code examples appearing in 
Section m would do well to note this evolution. What 
does change is the absolute time required to perform 
inference in the given models. In general, the latest 
compiled version of Anglican is ten to one hundred 
times faster than its interpreted ancestor. 


^http://www.robots.ox.ac.uk/~fwood/anglican/ 
^http://bitbucket.org/probprog/ 
interpreted-anglican 

■’http: //probcomp. csail .mit. edu/venture/ 

^http://bitbucket.org/probprog/anglican 
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2.1 Original Venture-Style Syntax 

The original, deprecated Anglican syntax is a 
Scheme/Lisp-dialect that is extended with three top- 
level special forms that we refer to as directives 

[assume symbol <expr>] 

[observe <expr> <const>] 

[predict <expr>] 

Here, each <expr> is a Scheme/Lisp-syntax expression, 
symbol is a unique symbol, and <const> is a constant¬ 
valued deterministic expression. 

Semantically assume’s are (random) variable (genera¬ 
tive) declarations, observe ’s condition the distribution 
of assume’d variables by (noisily) constraining the out¬ 
put values of (random) functions of assume’d variables 
to match observed data, and predict’s are “watches” 
which report on (via print out) the values of variables 
in program traces as they are explored. In Anglican, 
probabilistic program interpretation is taken to be a 
forever-continuing exploration of the space of execu¬ 
tion traces that obey (where hard) or reflect (where 
soft) the observe’d constraints in order to report func¬ 
tions (via predict’s) of the conditional distribution of 
subsets of the assume’d variables. 

Like Scheme/Lisp, Anglican eagerly and exchange- 
ably recursively “evaluates” subexpressions, for in¬ 
stance of <expr> = (<proc> <arg> ... <arg>), before 
“applying” the procedure (which may be a random 
procedure or special form) resulting from evaluating 
the first expression <proc> to the value of its argu¬ 
ments. Anglican counts applications, reported as a 
computational cost proxy in Sec Anglican sup¬ 
ports several special forms, notably (lambda (<arg> 
... <arg>) <body>)) which allows creation of new pro¬ 
cedures and (if <pred> <cons> <alt>) which supplies 
branching control flow; also begin, let, define, quote, 
and cond. Anglican exposes eval and apply. Built-in 
deterministic procedures include list, car, cdr, cons, 
mem, etc., and arithmetic procedures +, -, \, etc. 

All randomness in the language originates from built- 
in “random primitives” of which there are two types. 
Elementary random primitives, such as poisson, gamma, 
flip, discrete, categorical, and normal, generate 
independent and identically distributed (i.i.d.) sam¬ 
ples when called repeatedly with the same argu¬ 
ments. Exchangeable random primitives, such as crp 
or beta-bernoulli, return a random procedure with 
internal state that generates an exchangeably dis¬ 
tributed samples when called repeatedly. 

In the interpreted version of Anglican the outer <proc> 
in all observe <expr>’s must be a built-in random prim¬ 
itive, which guarantees that likelihood of output given 
arguments can be computed exactly. 


2.2 New Clojure-Style Syntax 

The compiled version of Anglican now supports a syn¬ 
tax that more closely integrates with the host language 
Clojure. The macro defquery defines an Anglican 
program from within Clojure 

II (defquery symbol [argl arg2 . . .] <body>) 

Here symbol is the name of the program, and the ar¬ 
guments may be used to pass values of parameters or 
observed variables to the program. The set of allow¬ 
able body expressions <body> is a subset of the Clojure 
language, in which all basic Clojure language forms 
and first order primitives are supported. Macros and 
higher-order functions are not inherited from Clojure, 
although a subset of macro^ and higher-order func¬ 
tion^ has been implemented. 

In the CPS version of Anglican, random primitives 
such as normal and discrete return first class distribu¬ 
tion objects, rather than sampled values, and, further¬ 
more, are user programmable without requiring mod¬ 
ifications to the Anglican compiler. The special forms 
sample and observe associate values with random vari¬ 
ables drawn from a distribution 

(sample <dist>) 

(observe <dist> <value>) 

The language additionally provides a data type for se¬ 
quences of random variables that are not i.i.d., which 
we refer to as a random process. A random process 
implements two operations 

(produce <process>) 

(absorb <process> <value>) 

The produce primitive returns a distribution object 
for the next random variable in the sequence. The 
absorb primitive returns an updated random process 
instance, in which a value has been associated with 
the next random variable in the sequence. A random 
process is most commonly used to represent an ex¬ 
changeably distributed sequence of random variables, 
but it may be used to represent any sequence of ran¬ 
dom variables for which it is possible to construct a 
distribution on the next variable given preceding vari¬ 
ables. Random process constructors are customarily 
identified with uppercase names (e.g. CRP) and, in the 
same manner as before, are user programmable. 

Finally, the predict form may be used at any point in 
the program to generate labeled output values 

II (predict <label> <expr>) 

®when, cond 

®map, reduce, filter, some, repeatedly, comp, 
partial 
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Given a previously defined query, inference may be 
performed using the doquery macro 

(doquery algorithm <symbol> 

[<argl> <arg2> ...] 

<optl> <opt2> ...) 

The inference algorithm may be specified using an 
algorithm keyword, and any options to the inference 
algorithm can be supplied as arguments as well. The 
doquery macro constructs a lazy sequence of program 
execution states that contain predicted values and op¬ 
tionally an importance weight, which may then be con¬ 
sumed and analyzed by, for instance, an outer Clojure 
or Java program. 

3 Inference 

An execution trace is the sequence of memory states 
resulting from the sequence of function applications 
performed during the interpretation of a program. In 
probabilistic programming systems like Anglican any 
variable may be declared as being the output of a 
random procedure. Such variables can take different 
values in independent interpretations of the program. 
This leads to a “many-worlds” computational trace 
tree in which, at interpretation time, there is a branch 
at every random procedure application. 

To define the probability of a single execution trace, 
first fix an ordering of the exchangeable lines of 
the program and index the observe lines by n. Let 
p{yn\St„,^n) be the likelihood of the observe’d output 
Un where tn is a random procedure type (i.e. gamma, 
poisson, etc.), dt„ is its argument (possibly multi¬ 
dimensional), and x„ is the set of all random procedure 
application results computed before the likelihood of 
observation y„ is evaluated. Both the type and the 
parameter Ot^ can be functions of any in-scope sub¬ 
set of x„. We can then define the probability of an 
execution trace to be 

N 

P(y>x) = J|p(y„|6»t„,x„)p(x„|x„_i) (1) 

n—1 

where y is the set of all observe’d quantities, x is the 
set of all random procedure application results, and ^ 
marks distributions which we can only sample. 

The number and type of all random procedure appli¬ 
cations performed before the nth observe may vary in 
one program trace to the next. We define the proba¬ 
bility of the sequence of their outputs x„ to be 

p(x„|x„_i) (2) 

|x„\x„_i| 

k=l 


Here Xq = 0 is the empty set and Xn,j-.k are the j 
to kth values generated by random procedure applica¬ 
tions in the trace up to observation likelihood compu¬ 
tation n. The cardinality of the set x„ \ x„_i, notated 
I • I, arises implicitly as the total number of random 
procedure applications in a given execution trace. As 
before, 9t^ are the arguments to, and tbe type 
of, the (n, fc)th random procedure - both of which may 
be functions of subsets of in-scope subsets of variables 
2 ^n:i,(fe-i)Ux„_i. Note that x„_i C x„. Also note that 
variable referencing defines a directed conditional de¬ 
pendency structure for the probability model encoded 
by the program, i.e. Xn,k need not (and often cannot 
due to variable scoping) depend on the outputs of all 
previous random procedure applications. 

We use sampling to explore and characterize the distri¬ 
bution p(x|y) cx p(y, x), i.e. the distribution of all ran¬ 
dom procedure outputs that lead to different program 
execution traces, conditioned on observed data. Re¬ 
lated approaches include rejection sampling [3], single¬ 
site MH [21 [12]. 

All members of the set of all directed probabilistic 
models with hxed-structure joint distributions can be 
expressed as probabilistic programs that “unroll” in all 
possible execution traces into an equivalent joint dis¬ 
tribution. As Church-like probabilistic programming 
frameworks, Anglican included, support recursive pro¬ 
cedures and branching on the values returned by ran¬ 
dom procedures, the corresponding set of models is 
a superset of the set of all directed graphical mod¬ 
els. Other related efforts eschew Turing-completeness 
and operate on a restricted set of models [nui Ellin] 
where inference techniques other than sampling can 
more readily be employed. 

3.1 A Ne-w Approach 

Towards our new approach to probabilistic program¬ 
ming inference, first consider a standard sequential 
Monte Carlo (SMC) recursion for sampling from a se¬ 
quence of intermediate distributions that terminates 
in p(x|y) oc p(y,x) where x and y are as before, 
and the joint is given by Eq.’s[2and[^ Note that a 
sequence of intermediate approximating distributions 
can be constructed from any syntactically allowed re¬ 
ordering of 

p(xi)p(x 2 |xi) • • •p(x„|x„_i)p( 2 /i|xi) • • •p(j/„|x„). 

Assume that observation likelihoods are pushed as far 
left in this sequence of approximating distributions as 
possible; however it is clear how to proceed if this is 
not the case. Assume we have 1 < ^ < L unweighted 
samples x[2.j^ ~ p{'^n-i\yi-\n-i)) and that from these 
we will produce approximate samples from p(x„|j/i:„). 
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To do so via importance sampling we may choose 
any proposal distribution q(x„|x„_i, Sampling 
from this and weighting by the discrepancy between 
it and the distribution of interest, p(x„,yi.„), we ar- 
rive at samples with unnormalized weights Wn = 

Here hats notate the 
difference between weighted and unweighted samples, 
those with being weighted and vice versa. 

This expression simplifies substantially in the “propose 
from the prior” case where the proposal distribution 
is defined to be the continued interpretation of the 
program from observation likelihood evaluation n — 1 
ton, i.e. =p(xi^Vn-i)- In this case 

the weight simplifies to Wn ^ = Sampling 

an unweighted particle set ~ 6^{i), where 

/ Y^-Wn \ Completely describes SMC for 
probabilistic program inference. 

The SMC procedure described is, to first approxima¬ 
tion, the inner loop of PMCMC. It corresponds to a 
procedure whereby the probabilistic program is inter¬ 
preted in parallel (possibly each particle in its own 
thread or process) between observation likelihood cal¬ 
culations. Unfortunately, SMC with a finite set of par¬ 
ticles is not itself directly viable for probabilistic pro¬ 
gramming inference for all the familiar reasons: parti¬ 
cle degeneracy, inefficiency in models with global, con¬ 
tinuous parameters, etc. 

PMCMC, on the other hand, is directly viable. PM¬ 
CMC for probabilistic programming inference is a MH 
algorithm for exploring the space of execution traces 
that uses SMC proposals internally. This, unlike prior 
art, allows sampling of execution traces with changes 
to potentially many more than one variable at a time. 
The particular variant of PMCMC we discuss in this 
paper is Particle-Gibbs (PG) although we have devel¬ 
oped engines based on other PMCMG variants includ¬ 
ing particle independent Metropolis Hastings and con¬ 
ditional sequential Monte Garlo too. PG works by it¬ 
eratively re-running SMC, with, on all but the first 
sweep, reinsertion of a “retained” particle trace into 
the set of particles at every stage of SMC. PG is the¬ 
oretically justified as an MH transition operator that, 
like the Gibbs operator, always accepts mm- In this 
paper we describe PMGMG for probabilistic program¬ 
ming inference algorithmically in Algj^and experimen¬ 
tally demonstrate its relative efficacy for probabilistic 
programming inference. 

In Alg. the function r{N,S) stands for multi¬ 
nomial sample N items from the set of pairs S = 
{{wi,9i},... ,{wM,dM}}, where each element of S 
consists of an unnormalized weight Wm and interpreter 
memory states 9m- For each sample value 9m returned. 


the function r also returns the original, corresponding 
unnormalized weight Wm- This kind of weight book¬ 
keeping retains, for all particles, the results of the out¬ 
ermost observe likelihood function applications so that 
the unnormalized weights are available in the retained 
particle {r(;*,x*} in the next sweep. 


Algorithm 1 PMCMC for Prob. Prog. Inference 
L ^ number of particles 
S ^ number of sweeps 
{■u;^\x^^} •(— Run SMC 

for s < S do 

{•,x;^} ^ r(l,{l/L,x^^}) 

{•,Xq^^} ^ initialize L — 1 interpreters 
for d G ordered lines of program do 

for £< L — 1 do 

end for 

if directive(d) == “assume” then 
for f < L — 1 do 

Xn^ G- interpret(d, x|2.]^) 

end for 

{Xn^} ^ U X* 

else if directive(d) == “predict” then 
for f < L — 1 do 
interpret(d, x|2.i) 

end for 

interpret (d, X*_ 2 ) 

else if directive(d) == “observe” then 
for f < L — 1 do 

{Wn\^n'^} ^ interpret (d, X^^lj 

end for 

T ^ r{L- l,{w;i^\xi^^}U{w*,x*}) 

^ TU {■ui*,x;} 

end if 
end for 
end for 


In Alg. [2 “Run SMC” means running one sweep of 
the s loop with L particles and no retained particle, 
d is a program line, and fork(-) means to copy the 
entire interpreter memory datastructure (efficient im¬ 
plementations have characteristics similar to POSIX 
f orkO 0). The command interpret(d, •) means ex¬ 
ecute line d in the given interpreter. Only when in¬ 
terpreting an observe must the interpreter return a 
weight, that being result of the outermost apply of 
the observe statement. Bars indicate temporary data 
structures, not averages. All sets are ordered with 
unions implemented by append operations. 

Note that there are more efficient PMCMC algorithms 
for probabilistic programming inference. In particular, 
there is no reason to fork unless an observe has just 
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been interpreted. Alg. is presented in this form for 
expositional purposes. 


3.2 Random Database 


We refer to the MH approach to sampling over 
the space of all traces proposed in [12] as “random 
database” (RDB). A RDB sampler is a MH sampler 
where a single variable drawn in the course of a partic¬ 
ular interpretation of a probabilistic program is modi¬ 
fied via a standard MH proposal, and this modification 
is accepted by comparing the value of the joint distri¬ 
bution of old and new program traces. For complete¬ 
ness we review RDB here, noting a subtle correction to 
the acceptance ratio proposed in the original reference 
which is proper for a larger family of models. 

The RDB sampler employs a data structure that holds 
all random variables x associated with an execution 
trace, along with the parameters and log probability 
of each draw. Note that interpretation of a program 
is deterministic conditioned on x. A new proposal 
trace is initialized by picking a single variable Xm,j 
from the |x| random draws, and resampling its value 
using a reversible kernel Starting from 

this initialization, the program is rerun to generate a 
new set of variables x' that correspond to a new valid 
execution trace. In each instance where the random 
procedure type remains the same, we reuse the exist¬ 
ing value from the set x, rescoring its log probability 
conditioned on the preceding variables where neces¬ 
sary. When the random procedure type has changed, 
or a new random variable is encountered, its value is 
sampled in the usual manner. Finally, we compute 
the probability p(y|x') by rescoring each observe as 
needed, and accept with probability 


min 


/ p(y|x'Mx')g(x|x') \ 

V ’ p(y|x)p(x)g(x'|x) J 


(3) 


In order to calculate the ratio of the proposal proba¬ 
bilities (/(x'jx) and (/(xjx'), we need to account for the 
variables that were resampled in the course of con¬ 
structing the proposal, as well as the fact that the 
sets x' and x may have different cardinalities |x'| and 
|x|. We will use the (slightly imprecise) notation x'\x 
to refer to the set of variables that were resampled, 
and let x' H x represent the set of variables common 
to both execution traces. The proposal probability is 
now given by 


9 (x'|x) 


p(x'\x I x' n x) 
|x| p(a;',„j|x'nx) ■ 


(4) 


In our implementation, the initialization x'^ ^ is sim¬ 
ply resampled conditioned on the preceding variables, 
such that K{x'j^ j\xm,j) = Pix'^ jlx' n x). The reverse 


proposal density can now be expressed in a similar 
fashion in terms of x\x' and x n x' = x' n x, allowing 
the full acceptance probability to be written as 

p(y|x') p(x') |x| p(x\x' I X n x') 

p(y|x) p(x) |x'| p(x'\x I x' n x) ■ 

4 Testing 

Programming probabilistic program interpreters is a 
non-trivial software development effort, involving both 
the correct implementation of an interpreter and the 
correct implementation of a general purpose sampler. 
The methodology we employ to ensure correctness of 
both involves three levels of testing; 1) unit tests, 2) 
measure tests, and 3) conditional measure tests. 

4.1 Unit and Measnre Tests 

In the context of probabilistic programming, unit test¬ 
ing includes verifying that the interpreter correctly 
interprets a comprehensive set of small deterministic 
programs. Measure testing involves interpreting short 
revealing programs consisting of assume and predict 
statements (producing a sequence of ancestral, uncon¬ 
ditioned samples, i.e. no observe’s). Interpreter out¬ 
put is tested relative to ground truth, where ground 
truth is computed via exhaustive enumeration, ana¬ 
lytic derivation, or some combination, and always in 
a different, well-tested independent computational sys¬ 
tem like Matlab. Various comparisons of the empirical 
distribution constructed from the accumulating stream 
of output predicts’s and ground truth are computed; 
Kulback-Leibler (KL) divergences for discrete sample 
spaces and Kolmogorov Smirnov (KS) test statistics 
for continuous sample spaces. While it is possible 
to construct distribution equality hypothesis tests for 
some combinations of test statistic and program we 
generally are content to accept interpreters for which 
there is clear evidence of convergence towards zero of 
all test statistics for all measure tests. Anglican passed 
all unit and measure tests. 

4.2 Conditional Measure Tests 

Measure tests involving conditioning provide addi¬ 
tional information beyond that provided by measure 
and unit tests. Conditioning involves endowing pro¬ 
grams with observe statements which constrain or 
weight the set of possible execution traces. Interpret¬ 
ing observe statements engages the full inference ma¬ 
chinery. Conditional measure test performance is mea¬ 
sured in the same way as measure test performance. 
They are also how we compare different probabilistic 
programming inference engines. 
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Figure 1: Comparative conditional measure test performance: PMCMC with 100 particles vs. RDB. 


5 Inference Engine Comparison 


respond to expressive probabilistic graphical models 
with rich conditional dependencies. 


We compare PMCMC to RDB measuring convergence 
rates for an illustrative set of conditional measure test 
programs. Results from four such tests are shown in 
Figure where the same program is interpreted using 
both inference engines. PMCMC is found to converge 
faster for conditional measure test programs that cor¬ 


The four test programs are: 1) a program that corre¬ 
sponds to state estimation in a hidden Markov model 
(HMM) with continuous observations (HMM: Pro¬ 
gram Figure la), 2) a program that corresponds 
to learning an uncollapsed Dirichlet process (DP) mix- 
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ture of Gaussians with fixed hyperparameters (DP 
Mixture: Program 5.2 Figure lb), 3) a multimodal 
branching with deterministic recursion program that 
cannot be represented as a graphical model in which all 
possible execution paths can be enumerated (Branch¬ 
ing: Program 5.3 Figure Ic), and 4) a program that 
corresponds to inferring the mean of a univariate nor¬ 
mal generated via an Anglican-coded Marsaglia [6] 
rejection-sampling algorithm that halts with probabil¬ 
ity one and generates an unknown number of inter¬ 
nal random variables (Marsaglia: Program 5.4 Fig¬ 
ure Id). We refer to 1) and 2) as “expressive” mod¬ 


els because they have complex conditional dependency 
structures and 3) and 4) as simple models because the 
programs encode models with very few free parame¬ 
ters. 1) and 2) illustrate our claims; 3) and 4) are in¬ 
cluded to document the correctness and completeness 
of the Anglican implementation while also demonstrat¬ 
ing that the gains illustrated in 1) and 2) do not come 
at too great a cost even for simple programs for which, 
a priori, PMCMC might be reasonably be expected to 
underperform. 


In Figures [Tap ] there are three panels that report sim¬ 
ilar style findings across test programs and a fourth 
that is specific to the individual test program. In all, 
PMCMC results are reported for a single-threaded in¬ 
terpreter with 100 particles. The choice of 100 par¬ 
ticles is largely arbitrary; our results are stable for a 
large range of values. PMCMC is dark blue while RDB 
is light orange. We report the 25% (lower dashed) 
median (solid) and 75% (upper dashed) percentiles 
over 25 runs with differing random number seeds. We 
refer to KL/KL*/KS as distances and compute each 
via a running average of the empirical distribution of 
predict statement outputs to ground truth starting at 
the first predict output. Note that lower is better. 
We define the number of simulations to be the num¬ 
ber of times the program is interpreted in its entirety. 
For RDB this means that the number of simulations 
is exactly the number of sampler sweeps; for PMCMC 
it is the number of particles multiplied by the number 
of sampler sweeps. The time horizontal axes report 
wall clock time; the apply axes report the number of 
function applications performed by the interpreter. In 
the distance vs. time plots, observed single-threaded 
PMCMC wall-clock times are reported via filled cir¬ 
cles; the left-ward dotted lines illustrate hypothetically 
what should be achievable via parallelism. Carefully 
note that PMCMC requires completing a number of 
simulations equal to the number of particles (here 100) 
before emitting batched predict outputs. This means 
that single-threaded implementations of PMCMC suf¬ 
fer from latency that RDB does not. Still, for some 
programs both the quality of PMCMC’s predict out¬ 
puts and PMCMC’s convergence rate is faster even 


in direct wall clock time comparison to RDB. PM¬ 
CMC appears to converge faster for some programs 
than RDB even relative to the number of function ap¬ 
plications. Equivalent results were obtained relative 
to eval counts. 

5.1 HMM 

New 

(defquery hmm 

[observations init-dist trans-dists obs-dists] 

(predict 
:states 
(reduce 

(fn [states obs] 

(let [state (sample (get trans-dists 

(peek states)))] 

(observe (get obs-dists state) obs) 

(conj states state))) 

[(sample init-dist)] 
observations))) 

Original (deprecated) 

[assume initial-state-dist (list (/ 13) (/I 3) (/ 1 3))] 

[assume get-state-transition-dist (lambda (s) 

(cond ((= s 0) (list .1 .5 .4)) ((= s 1) (list .2 .2 .6)) 

((= s 2) (list .15 .15 .7))))] 

[assume transition (lambda (prev-state) 

(discrete (get-state-transition-dist prev-state)))] 

[assume get-state (mem (lambda (index) 

(if (<= index 0) (discrete initial-state-dist) 

(transition (get-state (- index 1))))))] 
[assume get-state-observation-mean (lambda (s) 

(cond ((= s 0) -1) ((= s 1) 1) ((= s 2) 0)))] 

[observe (normal (get-state-obs-mean (get-state 1)) 1) .9] 
[observe (normal (get-state-obs-mean (get-state 2)) 1) .8] 

[observe (normal (get-state-obs-mean (get-state 16)) 1) -1] 

[predict (get-state 0)] 

[predict (get-state 1)] 

[predict (get-state 16)] 

The HMM program corresponds to a latent state in¬ 
ference problem in an HMM with three states, one¬ 
dimensional Gaussian observations (.9, .8, .7, 0, -.025, 
5, 2, 0.1, 0, .13, .45, 6, .2, .3, -1, -1), with known means 
and variances, transition matrix, and initial state dis¬ 
tribution. The lines of the program were organized 
with the observe’s in “time” sequence. 

The KL* axis reports the sum of the Kulback-Leibler 
divergences between the running sample average state 
occupancy across all states of a HMM including the 
initial state and one trailing predictive state KL*g = 

Here 6^(s){k) returns one 
if simulation s has latent state z at time step i equal 
to k and jiik) is the true marginal probability of the 
latent state indicator Zj taking value k at time step 
i. The vertical red line in the apply plot indicates the 
time and number of applies it takes to run forward- 
backward in the Anglican interpreter. 

The fourth plot shows the learned posterior distribu¬ 
tion over the latent state value for all time steps in¬ 
cluding both the initial state and a trailing predictive 
time step. While RDB produces a reasonable approx¬ 
imation to the true posterior, it does so more slowly 
and with greater residual error. 
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5.2 DP Mixture 

New 

(defquery crp-mixture 

[observations alpha mu beta a b] 

(let [precision-prior (gamma a b)] 

(loop [observations observations 
state-proc (CRP alpha) 
obs-dists -C} 
states [] ] 

(if (empty? observations) 

(do 

(predict :states states) 

(predict :num-clusters (count obs-dists))) 

(let [state (sample (produce state-proc)) 
obs-dist (get obs-dists 
state 

(let [1 (sample precision-prior) 
s (sqrt (/ (* beta 1))) 
m (sample (normal mu s))] 
(normal m (sqrt (/ 1)))))] 
(observe obs-dist (first observations)) 

(recur (rest observations) 

(absorb state-proc state) 

(assoc obs-dists state obs-dist) 

(conj states state))))))) 


Original (deprecated) 

[assume class-generator (crp 1.72)] 

[assume class (mem (lambda (n) (class-generator)))] 

[assume var (mem (lambda (c) (* 10 (/ 1 (gamma 1 10)))))] 

[assume mean (mem (lambda (c) (normal 0 (var c))))] 

[assume u (lambda () (list (class 1) (class 2) ... 

(class 9) (class 10)))] 

[assume K (lambda () (count (unique (u))))] 

[assume means (lambda (i c) 

(if (= i c) (list (mean c)) 

(cons (mean i) (means (+ i 1) c) )))] 

[assume stds (lambda (i c) 

(if (= i c) (list (sqrt (* 10 (var c)))) 

(cons (var i) (stds (+ i 1) c) )))] 

[observe (normal (mean (class 1)) (var (class 1))) 1.0] 
[observe (normal (mean (class 2)) (var (class 2))) 1.1] 

[observe (normal (mean (class 10)) (var (class 10))) 0] 
[predict (u)] 

[predict (K)] 

[predict (means 1 (K))] 

[predict (stds 1 (K))] 

The DP mixture program corresponds to a clustering 
with unknown mean and variance problem modelled 
via a Dirichlet process mixture of one-dimensional 
Gaussians with unknown mean and variance (normal- 
gamma priors). The KL divergence reported is be¬ 
tween the running sample estimate of the distribution 
over the number of clusters in the data and the ground 
truth distribution over the same. The ground truth 
distribution over the number of clusters was computed 
for this model and data by exhaustively enumerating 
all partitions of the data (1.0, 1.1, 1.2, -10, -15, -20, 
.01, .1, .05, 0), analytically computing evidence terms 
by exploiting conjugacy, and conditioning on partition 
cardinality. The fourth plot shows the posterior distri¬ 
bution over the number of classes in the data computed 
by both methods relative to the ground truth. 

This program was written in a way that was inten¬ 
tionally antagonistic to PMCMC. The continuous like¬ 
lihood parameters were not marginalized out and the 
observe statements were not organized in an optimal 
ordering. Despite this, PMCMC outperforms RDB per 
simulation, wall clock time, and apply count. 


5.3 Branching 

New 

(defn fib [n] 

(loop [a 0 b 1 m 0] 

(if (= m n) 
a 

(recur b (+ a b) (inc m))))) 

(with-primitive-pr 0 cedures [fib] 

(defquery branching [] 

(let [count-prior (poisson 4) 
r (sample count-prior) 

1 (if (< 4 r) 

6 

(+ (fib (* 3 r)) 

(sample count-prior)))] 

(observe (poisson 1) 6) 

(predict :r r)))) 

Original (deprecated) 

[assume fib (lambda (n) 

(cond ((= n 0) 1) ((= n 1) 1) 

(else (+ (fib (- n 1)) (fib (- n 2))))))] 

[assume r (poisson 4)] 

[assume 1 (if (< 4 r) 6 (+ (fib (* 3 r)) (poisson 4)))] 
[observe (poisson 1) 6] 

[predict r] 

The branching program has no corresponding graph¬ 
ical model. It was designed to test for correctness of 
inference in programs with control logic and execution 
paths that can vary in the number of sampled values. 
It also illustrates mixing in a model where, as shown 
in the fourth plot, there is a large mismatch between 
the prior and the posterior, so rejection and impor¬ 
tance sampling are likely to be ineffective. Because 
there is only one observation and just a single named 
random variable PMCMC and RDB should and does 
achieve essentially indistinguishable performance nor¬ 
malized to simulation, time and apply count. 

5.4 Marsaglia 

New 

(defra marsaglia-normal [mu std] 

(let [u (uniform-continuous -1.0 1.0)] 

(loop [x (sample u) 
y (sample u)] 

(let [s (+ (* X x) (♦ y y))] 

(if (< s 1.0) 

(+ mu (* std (* X (sqrt (* -2.0 (/ (log s) s)))))) 
(recur (sample u) (sample u))))))) 

(defquery gaussian-marsaglia 

[observations sigma muO sigmaO] 

(let [mu (marsaglia-normal rauO sigmaO) 
likelihood (normal mu sigma)] 

(reduce (fn [_ obs] 

(observe likelihood obs)) 
nil 

observations) 

(predict :mu mu))) 

Original (deprecated) 

[assume marsaglia-normal 
(lambda (mu std) 

(define x (uniform-continuous -1.0 1.0)) 

(define y (uniform-continuous -1.0 1.0)) 

(define s (+ (* x x) (* y y))) 

(if (< s 1) 

(+ mu (* std (* X (sqrt (* -2.0 (/ (log s) s)))))) 
(marsaglia-normal mu std)))] 

[assume std (sqrt 2)] 

[assume mu (marsaglia-normal 1 (sqrt 5))] 

[observe (normal mu std) (+ 8 1)] 

[observe (normal mu std) 8] 

[predict mu] 
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Marsaglia is a test program included here for com¬ 
pleteness. It is an example of a type of program for 
which PMCMC sometimes may not be more efficient. 
Marsaglia is the name given to the rejection form of the 
Box-Muller algorithm [5] for sampling from a Gaus¬ 
sian [5]. The Marsaglia test program corresponds to 
an inference problem in which observed quantities are 
drawn from a Gaussian with unknown mean and this 
unknown mean is generated by an Anglican implemen¬ 
tation of the Marsaglia algorithm for sampling from a 
Gaussian. The KS axis is a Kolmogorov-Smirnov test 
statistic [S] computed by finding the maximum devia¬ 
tion between the accumulating sample and analytically 
derived ground truth cumulative distribution functions 
(GDF). Equal-cost PMCMC, RDB, and ground truth 
CDFs are shown in the fourth plot. 

Because Marsaglia is a recursive rejection sampler it 
may require many recursive calls to itself. We conjec¬ 
ture that RDB may be faster than PMCMC here be¬ 
cause, while PMCMC pays no statistical cost, it does 
pay a computational cost for exploring program traces 
that include many random procedure calls that lead to 
rejections whereas RDB, due to the implicit geomet¬ 
ric prior on program trace length, effectively avoids 
paying excess computational costs deriving from un¬ 
necessarily long traces. 



Figure 2: Effect of program line permutations 


5.5 Line Permutation 

Syntactically and semantically observe and predict’s 
are mutually exchangeable (so too are assume’s up to 
syntactic constraints). Given this and the nature of 
PMCMC it is reasonable to expect that line permu¬ 
tations could effect the efficiency of inference. We ex¬ 
plored this by randomly permuting the lines of the 
HMM and DP Mixture programs. The results are 
shown in Fig. where blue lines correspond to median 
(out of twenty five) runs of PMCMC for each of twenty 
five program line permutations (including unmodified 
(dark) and reversed (light)) and orange are the same 
for RDB. For the HMM we found that the natural, 
time-sequence ordering of the lines of the program re¬ 
sulted in the best performance for PMCMC relative 
to RDB. This is because in this ordering the observe’s 
cause re-weighting to happen as soon as possible in 
each SMC phase of PMCMC. The effect of permut¬ 
ing code lines interpolates inference performance be¬ 
tween optimal where PMCMC is best and adversarial 
orderings where RDB is instead. RDB performance is 
demonstrated here to be independent of the program 
line ordering. 

The DP Mixture results show PMCMC outperforming 
RDB on all program reorderings. Further, it can be 
seen that the original program ordering was not opti¬ 
mal with respect to PMCMC inference. 

While PMCMC presents the opportunity for signifi¬ 
cant gains in inference efficiency, it does not prevent 
programmers from seeking to further optimize perfor¬ 
mance manually. Programmers can influence inference 
performance by reordering program lines, in particu¬ 
lar pushing observe statements as near to the front 
of the program as syntactically allowed, or restructur¬ 
ing programs to lazily rather than eagerly generate 
latent variables. Efficiency gains via automatic trans¬ 
formations or online adaptation of the ordering may 
be possible. 



Figure 3: Effect of particle count on performance 


5.6 Number of Particles 

Fig. [3] shows the number of particles in the PMCMC 
inference engine affects performance. Performance im¬ 
proves as a function of the number of particles. In this 
plot the red line indicates 100 particles. Increasing 
from dark to light the number of particles plotted is 2, 
5, 10, 20, 50, (red, 100), 200, and 500. 

6 Discussion 

The PMCMC approach to probabilistic program in¬ 
terpretation appears to converge faster to the true 
conditiona distribution over program execution traces 
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for programs that correspond to expressive models 
with dense conditional dependencies than single-site 
Metropolis-Hastings methods. That PMCMC con¬ 
verges faster in our tests even after normalizing for 
computational time is, to us, both surprising and strik¬ 
ing. To reiterate, this includes all the computation 
done in all the particle executions. 

Current versions of Anglican include a multi-threaded 
inference core but work remains to achieve optimal 
parallelism due, in particular, to memory organiza¬ 
tion sub-optimality leading to excessive locking over¬ 
head. Using just a single thread, PMCMC surprisingly 
sometimes outperforms RDB per simulation anyway in 
terms of wall clock time. 

Our specific choice of syntactically forcing observe’s 
to be noisy requires language and interpreter level re¬ 
strictions and checks. Hard constraint observe’s can 
be supported in Anglican by exposing a Dirac like¬ 
lihood to programmers. Persisting in not doing so 
should help programmers avoid writing probabilistic 
programs where finding even a single satisfying exe¬ 
cution trace is NP-hard or not-computable, but also 
could be perceived as requiring a non-intuitive pro¬ 
gramming style. 

As explored in concurrent work on Venture, there may 
be opportunities to improve inference performance by 
partitioning program variables, either automatically or 
via syntax constructs, and treating them differently 
during inference. Then, for instance, some could be 
sampled via plain MH, some via conditional SMC. One 
simple way to do this would be to combine conditional 
SMC and RDB MH. 
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